我们证明,Littlestone Dimension $ d $的每一个在线学习的功能都可以接受具有有限信息复杂性的学习算法。为此,我们使用了全球稳定算法的概念。通常,这种全球稳定算法的信息复杂性是大但有限的,大致在$ d $中。我们还显示有改进的空间;对于规范的在线学习类,尺寸$ d $的仿射子空间的指标函数,信息复杂性可以在$ d $中以上对数。
translated by 谷歌翻译
在应用每个随机初始化的神经网络层后,数据集的几何表示如何改变?庆祝的约翰逊 - Lindenstrauss Lemma回答了线性完全连接的神经网络(FNNS)的这个问题,说明几何形状基本上是保存的。对于Relu激活的FNN,根据已知的映射,两个输入合同之间的角度。非线性卷积神经网络(CNNS)的问题变得更复杂。要回答这个问题,我们介绍了几何框架。对于线性CNNS,我们表明约翰逊 - 林登斯特兰LEMMA继续保持,即两个输入之间的角度被保留。另一方面,对于带有Relu激活的CNNS,行为富裕:输出合同之间的角度,其中收缩级别取决于输入的性质。特别地,在一层之后,基本上保留了自然图像的几何形状,而对于高斯相关的输入,CNNS表现出与具有Relu激活的FNN相同的收缩行为。
translated by 谷歌翻译
我们研究由SGD的变体训练的Relu神经网络的隐式偏置,其中在每个步骤中,标签以概率$ P $更改为随机标签(标记平滑是该过程的关闭变体)。我们的实验表明,标签噪声在以下意义上推动网络到稀疏解决方案:对于典型的输入,一小部分神经元是有效的,并且隐藏层的烧制图案是稀疏的。实际上,对于某些情况,适当的标签噪声不仅缩小网络,而且还减少了测试错误。然后,我们转向这些稀疏机制的理论分析,重点关注$ p = 1 $的极值案例。我们展示在这种情况下,网络沿着实验预期,但令人惊讶的是,以不同的方式依赖于学习率和偏见的存在,有重量消失或释放的神经元。
translated by 谷歌翻译
Training of neural networks is a computationally intensive task. The significance of understanding and modeling the training dynamics is growing as increasingly larger networks are being trained. We propose in this work a model based on the correlation of the parameters' dynamics, which dramatically reduces the dimensionality. We refer to our algorithm as \emph{correlation mode decomposition} (CMD). It splits the parameter space into groups of parameters (modes) which behave in a highly correlated manner through the epochs. We achieve a remarkable dimensionality reduction with this approach, where networks like ResNet-18, transformers and GANs, containing millions of parameters, can be modeled well using just a few modes. We observe each typical time profile of a mode is spread throughout the network in all layers. Moreover, our model induces regularization which yields better generalization capacity on the test set. This representation enhances the understanding of the underlying training dynamics and can pave the way for designing better acceleration techniques.
translated by 谷歌翻译
By transferring knowledge from large, diverse, task-agnostic datasets, modern machine learning models can solve specific downstream tasks either zero-shot or with small task-specific datasets to a high level of performance. While this capability has been demonstrated in other fields such as computer vision, natural language processing or speech recognition, it remains to be shown in robotics, where the generalization capabilities of the models are particularly critical due to the difficulty of collecting real-world robotic data. We argue that one of the keys to the success of such general robotic models lies with open-ended task-agnostic training, combined with high-capacity architectures that can absorb all of the diverse, robotic data. In this paper, we present a model class, dubbed Robotics Transformer, that exhibits promising scalable model properties. We verify our conclusions in a study of different model classes and their ability to generalize as a function of the data size, model size, and data diversity based on a large-scale data collection on real robots performing real-world tasks. The project's website and videos can be found at robotics-transformer.github.io
translated by 谷歌翻译
Graph is a highly generic and diverse representation, suitable for almost any data processing problem. Spectral graph theory has been shown to provide powerful algorithms, backed by solid linear algebra theory. It thus can be extremely instrumental to design deep network building blocks with spectral graph characteristics. For instance, such a network allows the design of optimal graphs for certain tasks or obtaining a canonical orthogonal low-dimensional embedding of the data. Recent attempts to solve this problem were based on minimizing Rayleigh-quotient type losses. We propose a different approach of directly learning the eigensapce. A severe problem of the direct approach, applied in batch-learning, is the inconsistent mapping of features to eigenspace coordinates in different batches. We analyze the degrees of freedom of learning this task using batches and propose a stable alignment mechanism that can work both with batch changes and with graph-metric changes. We show that our learnt spectral embedding is better in terms of NMI, ACC, Grassman distance, orthogonality and classification accuracy, compared to SOTA. In addition, the learning is more stable.
translated by 谷歌翻译
Using massive datasets to train large-scale models has emerged as a dominant approach for broad generalization in natural language and vision applications. In reinforcement learning, however, a key challenge is that available data of sequential decision making is often not annotated with actions - for example, videos of game-play are much more available than sequences of frames paired with their logged game controls. We propose to circumvent this challenge by combining large but sparsely-annotated datasets from a \emph{target} environment of interest with fully-annotated datasets from various other \emph{source} environments. Our method, Action Limited PreTraining (ALPT), leverages the generalization capabilities of inverse dynamics modelling (IDM) to label missing action data in the target environment. We show that utilizing even one additional environment dataset of labelled data during IDM pretraining gives rise to substantial improvements in generating action labels for unannotated sequences. We evaluate our method on benchmark game-playing environments and show that we can significantly improve game performance and generalization capability compared to other approaches, using annotated datasets equivalent to only $12$ minutes of gameplay. Highlighting the power of IDM, we show that these benefits remain even when target and source environments share no common actions.
translated by 谷歌翻译
Model-based reinforcement learning (RL) methods are appealing in the offline setting because they allow an agent to reason about the consequences of actions without interacting with the environment. Prior methods learn a 1-step dynamics model, which predicts the next state given the current state and action. These models do not immediately tell the agent which actions to take, but must be integrated into a larger RL framework. Can we model the environment dynamics in a different way, such that the learned model does directly indicate the value of each action? In this paper, we propose Contrastive Value Learning (CVL), which learns an implicit, multi-step model of the environment dynamics. This model can be learned without access to reward functions, but nonetheless can be used to directly estimate the value of each action, without requiring any TD learning. Because this model represents the multi-step transitions implicitly, it avoids having to predict high-dimensional observations and thus scales to high-dimensional tasks. Our experiments demonstrate that CVL outperforms prior offline RL methods on complex continuous control benchmarks.
translated by 谷歌翻译
In offline reinforcement learning (RL), a learner leverages prior logged data to learn a good policy without interacting with the environment. A major challenge in applying such methods in practice is the lack of both theoretically principled and practical tools for model selection and evaluation. To address this, we study the problem of model selection in offline RL with value function approximation. The learner is given a nested sequence of model classes to minimize squared Bellman error and must select among these to achieve a balance between approximation and estimation error of the classes. We propose the first model selection algorithm for offline RL that achieves minimax rate-optimal oracle inequalities up to logarithmic factors. The algorithm, ModBE, takes as input a collection of candidate model classes and a generic base offline RL algorithm. By successively eliminating model classes using a novel one-sided generalization test, ModBE returns a policy with regret scaling with the complexity of the minimally complete model class. In addition to its theoretical guarantees, it is conceptually simple and computationally efficient, amounting to solving a series of square loss regression problems and then comparing relative square loss between classes. We conclude with several numerical simulations showing it is capable of reliably selecting a good model class.
translated by 谷歌翻译
分布式形态框架的支持者提出了两个形态形成的两个层面:一个较低的单词形成,导致输入输出语义关系松散;和一个高层,导致了紧密的输入输出语义关系。在这项工作中,我们建议在希伯来语单词嵌入的背景下测试该假设的有效性。如果两个级别的假设得到了证实,我们期望最先进的希伯来语单词嵌入将编码(1)名词,(2)从其衍生而来(通过上级操作)和(3)和(3 )与名词相关的动词(通过名词根部的低级操作),以使得(2)在嵌入空间中应比相关动词(3)更接近名词(1)。是相同的名词(1)。我们报告说,这一假设通过希伯来语的四个嵌入模型来验证:FastText,Glove,Word2Vec和Alephbert。这表明单词嵌入模型能够捕获出于形态学动机的复杂而细粒的语义属性。
translated by 谷歌翻译